{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Custom Entity detection with Textract and Comprehend\n",
    "\n",
    "## Contents\n",
    "1. [Background](#Background)\n",
    "1. [Setup](#Setup)\n",
    "1. [Data Prep](#Data-Prep)\n",
    "1. [Textract OCR++](#Textract-OCR++)\n",
    "1. [Amazon GroundTruth Labeling](#Amazon-GroundTruth-Labeling)\n",
    "1. [Comprehend Custom Entity Training](#Comprehend-Custom-Entity-Training)\n",
    "1. [Model Performance](#Model-Performance)\n",
    "1. [Inference](#Inference)\n",
    "1. [Results](#Results)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Background\n",
    "\n",
    "In this notebook, we will cover how to extract and build a custom entity recognizer using Amazon Textract and Comprehend. We will be using Amazon Textract to perform OCR++ on scanned document, GroundTruth to label the interested entities, then passing the extracted documents to Amazon Comprehend to build and train a custom entity recognition model. No prior machine learning knowledge is required. \n",
    "\n",
    "In this example, We are using a public dataset from Kaggle: [Resume Entities for NER](https://www.kaggle.com/dataturks/resume-entities-for-ner?select=Entity+Recognition+in+Resumes.json). The dataset comprised 220 samples of candidate resumes in JSON format. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "_This Notebook was created on ml.t2.medium notebook instances._\n",
    "\n",
    "Let's start by install and import all neccessary libaries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Installing tqdm Python Library\n",
    "!pip install tqdm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "import logging\n",
    "import boto3\n",
    "import glob\n",
    "import time\n",
    "import os \n",
    "from tqdm import tqdm\n",
    "import json\n",
    "\n",
    "\n",
    "region = boto3.Session().region_name    \n",
    "role = sagemaker.get_execution_role()\n",
    "bucket = sagemaker.Session().default_bucket()\n",
    "prefix = 'textract_comprehend_NER'\n",
    "\n",
    "iam = boto3.client('iam')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create Data Access Role for Comprehend"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Policy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "assume_role_policy_doc = {\n",
    "  \"Version\": \"2012-10-17\",\n",
    "  \"Statement\": [\n",
    "    {\n",
    "      \"Effect\": \"Allow\",\n",
    "      \"Principal\": {\n",
    "        \"Service\": \"comprehend.amazonaws.com\"\n",
    "      },\n",
    "      \"Action\": \"sts:AssumeRole\"\n",
    "    }\n",
    "  ]\n",
    "} "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Role and Attach Policies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iam_textract_comprehend_role_name = 'DSOAWS_Textract_Comprehend'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import time\n",
    "\n",
    "from botocore.exceptions import ClientError\n",
    "\n",
    "try:\n",
    "    iam_role_textract_comprehend = iam.create_role(\n",
    "        RoleName=iam_textract_comprehend_role_name,\n",
    "        AssumeRolePolicyDocument=json.dumps(assume_role_policy_doc),\n",
    "        Description='DSOAWS Textract Comprehend Role'\n",
    "    )\n",
    "except ClientError as e:\n",
    "    if e.response['Error']['Code'] == 'EntityAlreadyExists':\n",
    "        iam_role_textract_comprehend = iam.get_role(RoleName=iam_textract_comprehend_role_name)\n",
    "        print(\"Role already exists\")\n",
    "    else:\n",
    "        print(\"Unexpected error: %s\" % e)\n",
    "        \n",
    "time.sleep(30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "textract_comprehend_s3_policy_doc = {\n",
    "    \"Version\": \"2012-10-17\",\n",
    "    \"Statement\": [\n",
    "        {\n",
    "            \"Action\": [\n",
    "                \"s3:GetObject\"\n",
    "            ],\n",
    "            \"Resource\": [\n",
    "                \"arn:aws:s3:::{}/*\".format(bucket)\n",
    "            ],\n",
    "            \"Effect\": \"Allow\"\n",
    "        },\n",
    "        {\n",
    "            \"Action\": [\n",
    "                \"s3:ListBucket\"\n",
    "            ],\n",
    "            \"Resource\": [\n",
    "                \"arn:aws:s3:::{}\".format(bucket)\n",
    "            ],\n",
    "            \"Effect\": \"Allow\"\n",
    "        },\n",
    "        {\n",
    "            \"Action\": [\n",
    "                \"s3:PutObject\"\n",
    "            ],\n",
    "            \"Resource\": [\n",
    "                \"arn:aws:s3:::{}/*\".format(bucket)\n",
    "            ],\n",
    "            \"Effect\": \"Allow\"\n",
    "        }\n",
    "    ]\n",
    "}\n",
    "\n",
    "print(textract_comprehend_s3_policy_doc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Attach Policy to Role"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "response = iam.put_role_policy(\n",
    "    RoleName=iam_textract_comprehend_role_name,\n",
    "    PolicyName='DSOAWS_ComprehendPolicyToS3',\n",
    "    PolicyDocument=json.dumps(textract_comprehend_s3_policy_doc)\n",
    ")\n",
    "\n",
    "print(response)\n",
    "\n",
    "time.sleep(30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iam_role_textract_comprehend_arn = iam_role_textract_comprehend['Role']['Arn']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Prep <a class=\"anchor\" id=\"Data-Prep\"></a>\n",
    "\n",
    "PDF and PNG are most common format for scanned documents within enterprises. We already converted these resumes into PDF format to emulate this. Let's upload all these PDF resumes onto S3 for Textract processing. Please note, there are only 220 samples of resume inside the dataset. By modern standards, this is a very small dataset. This dataset also come with few labeled custom entities. However, we will be running this dataset through Amazon GroundTruth to obtain a fresh copy of entity list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Uploading PDF resumes to S3\n",
    "pdfResumeFileList = glob.glob(\"./resume_pdf/*.pdf\")\n",
    "prefix_resume_pdf = prefix + \"/resume_pdf/\"\n",
    "\n",
    "for filePath in tqdm(pdfResumeFileList):\n",
    "    file_name = os.path.basename(filePath)\n",
    "    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_resume_pdf, file_name)).upload_file(filePath)\n",
    "\n",
    "resume_pdf_bucket_name = 's3://'+bucket+'/'+prefix+'/'+'resume_pdf/'\n",
    "print('Uploaded Resume PDFs :\\t', resume_pdf_bucket_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Textract OCR++ <a class=\"anchor\" id=\"Textract-OCR++\"></a>\n",
    "\n",
    "Now these PDFs are ready for Textract to perform OCR++, you can kick off the process with [StartDocumentTextDetection](https://docs.aws.amazon.com/textract/latest/dg/API_StartDocumentTextDetection.html) async API cal. Here we are only set to process 2 resume PDF for demonstrating the process. To save time, we have all 220 resumes processed and avaliable for you. See textract_output directory for all the reuslts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s3_client = boto3.client('s3')\n",
    "pdf_object_list = []\n",
    "\n",
    "# Getting a list of resume PDF files:\n",
    "response = s3_client.list_objects(\n",
    "    Bucket= bucket,\n",
    "    Prefix= prefix+'/'+'resume_pdf/text_output'\n",
    ")\n",
    "\n",
    "for obj in response['Contents']:\n",
    "    pdf_object_list.append(obj['Key'])\n",
    "\n",
    "pdf_object_list[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "from s3_textract_functions import *\n",
    "import codecs\n",
    "\n",
    "sample_to_process = 2\n",
    "\n",
    "# We are only processing few files as example; You do not need to process all 220 files\n",
    "for file_obj in tqdm(pdf_object_list[:sample_to_process]):\n",
    "    print('Textract Processing PDF: \\t'+ file_obj)             \n",
    "    job_id = StartDocumentTextDetection(bucket, file_obj)\n",
    "    print('Textract Job Submitted: \\t'+ job_id)\n",
    "    response = getDocumentTextDetection(job_id)\n",
    "    \n",
    "    # renaming .pdf to .text\n",
    "    text_output_name = file_obj.replace('.pdf', '.txt')\n",
    "    text_output_name = text_output_name[(text_output_name.rfind('/')+1):]\n",
    "    print('Output Name:\\t', text_output_name)\n",
    "    \n",
    "    output_dir = './textract_output/'\n",
    "    \n",
    "    # Writing Textract Output to Text Files:\n",
    "    with codecs.open(output_dir + text_output_name, \"w\", \"utf-8\") as output_file:\n",
    "        for item in response[\"Blocks\"]:\n",
    "            if item[\"BlockType\"] == \"LINE\":\n",
    "                print('\\033[94m' + item[\"Text\"] + '\\033[0m')\n",
    "                output_file.write(item[\"Text\"]+'\\n')\n",
    "    output_file.close()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "    \n",
    "# Uploading Textract Output to S3\n",
    "textract_output_filelist = glob.glob(\"./textract_output/*.txt\")\n",
    "prefix_textract_output = prefix + \"/textract_output/\"\n",
    "\n",
    "for filePath in tqdm(textract_output_filelist):\n",
    "    file_name = os.path.basename(filePath)\n",
    "    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_textract_output, file_name)).upload_file(filePath)\n",
    "\n",
    "comprehend_input_doucuments = 's3://' + bucket+'/'+prefix_textract_output\n",
    "print('Textract Output:\\t', comprehend_input_doucuments)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Amazon GroundTruth Labeling <a class=\"anchor\" id=\"Amazon-GroundTruth-Labeling\"></a>\n",
    "\n",
    "Since we need to train a custom entity recognition model with Comprehend, and with any machine learning models, we need large amount of training data. In this example, we are leveraging Amazon GroundTruth to label our entities. Amazon Comprehend by default already can recognize entities like [Person, Title, Organization, and etc](https://docs.aws.amazon.com/comprehend/latest/dg/how-entities.html). To demonstrate custom entity recognition capability, we are focusing on Skill entities inside these resumes. We have the labeled and cleaned the data with Amazon GroundTruth (see: entity_list.csv). If you are interested, you can follow this blog to [add data labeling workflow for named entity recognition](https://aws.amazon.com/blogs/machine-learning/adding-a-data-labeling-workflow-for-named-entity-recognition-with-amazon-sagemaker-ground-truth/). \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we start training, let's upload the entity list onto S3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uploading Entity List to S3\n",
    "entity_list_file = './entity_list.csv'\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix+'/entity_list/', 'entity_list.csv')).upload_file(entity_list_file)\n",
    "\n",
    "comprehend_input_entity_list = 's3://' + bucket+'/'+prefix+'/entity_list/'+'entity_list.csv'\n",
    "print('Entity List:\\t', comprehend_input_entity_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comprehend Custom Entity Training <a class=\"anchor\" id=\"Comprehend-Custom-Entity-Training\"></a>\n",
    "\n",
    "Now we have both raw and labeled data, and ready to train our model. You can kick off the process with create_entity_recognizer API call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "comprehend_client = boto3.client('comprehend')\n",
    "\n",
    "custom_recognizer_name = 'resume-entity-recognizer-'+ str(int(time.time()))\n",
    "\n",
    "comprehend_custom_recognizer_response = comprehend_client.create_entity_recognizer(\n",
    "    RecognizerName = custom_recognizer_name,\n",
    "    DataAccessRoleArn=iam_role_textract_comprehend_arn,\n",
    "    InputDataConfig={\n",
    "        'EntityTypes': [\n",
    "            {\n",
    "                'Type': 'SKILLS'\n",
    "            },\n",
    "        ],\n",
    "        'Documents': {\n",
    "            'S3Uri': comprehend_input_doucuments\n",
    "        },\n",
    "        'EntityList': {\n",
    "            'S3Uri': comprehend_input_entity_list\n",
    "        }\n",
    "    },\n",
    "    LanguageCode='en'\n",
    ")\n",
    "\n",
    "print(json.dumps(comprehend_custom_recognizer_response, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the training job is submitted, you can see the recognizer is being trained on Comprehend Console. \n",
    "This will take approxiamately 20 minutes to train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Performance <a class=\"anchor\" id=\"Model-Performance\"></a>\n",
    "\n",
    "In the training, Comprehend will divide the dataset into training documents and test documents. Once the recognizer is trained, you can see the recognizer’s overall performance, as well as the performance for each entity. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "comprehend_model_response = comprehend_client.describe_entity_recognizer(\n",
    "    EntityRecognizerArn=comprehend_custom_recognizer_response['EntityRecognizerArn']\n",
    ")\n",
    "status = comprehend_model_response['EntityRecognizerProperties']['Status']\n",
    "print('Training Job Status:\\t', status)\n",
    "\n",
    "while status != 'TRAINED':\n",
    "    comprehend_model_response = comprehend_client.describe_entity_recognizer(\n",
    "        EntityRecognizerArn=comprehend_custom_recognizer_response['EntityRecognizerArn']\n",
    "    )\n",
    "    status = comprehend_model_response['EntityRecognizerProperties']['Status']\n",
    "    print(status)\n",
    "    time.sleep(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Number of Document Trained:\\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['NumberOfTrainedDocuments'])\n",
    "print('Number of Document Tested:\\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['NumberOfTestDocuments'])\n",
    "print('\\n-------------- Evaluation Metrics: ----------------')\n",
    "print('Precision:\\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['Precision'])\n",
    "print('ReCall:\\t\\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['Recall'])\n",
    "print('F1 Score:\\t', comprehend_model_response['EntityRecognizerProperties']['RecognizerMetadata']['EvaluationMetrics']['F1Score'])    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inference\n",
    "\n",
    "Next, we have prepared a small sample of text to test out our newly trained custom entity recognizer. First, we will upload the document onto S3 and start a custom recognizer job. Once the job is submitted, you can see the progress in console under `Amazon Comprehend` ==> `Analysis Jobs` "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Uploading Test PDF resumes to S3 for OCR++"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdfResumeFileList = glob.glob(\"./test_document/*.pdf\")\n",
    "prefix_resume_pdf = prefix + \"/test_document/\"\n",
    "\n",
    "for filePath in tqdm(pdfResumeFileList):\n",
    "    file_name = os.path.basename(filePath)\n",
    "    boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix_resume_pdf, file_name)).upload_file(filePath)\n",
    "\n",
    "resume_pdf_bucket_name = 's3://'+bucket+'/'+prefix+'/'+'test_document/'\n",
    "print('Uploaded Resume PDFs :\\t', resume_pdf_bucket_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Performing OCR++ Using Textract"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pdf_object_list = []\n",
    "pdf_object_list.append(prefix_resume_pdf+\"test_document.pdf\")\n",
    "\n",
    "output_dir = './test_document/'\n",
    "\n",
    "for file_obj in tqdm(pdf_object_list):\n",
    "    print('Textract Processing PDF: \\t'+ file_obj)             \n",
    "    job_id = StartDocumentTextDetection(bucket, file_obj)\n",
    "    print('Textract Job Submitted: \\t'+ job_id)\n",
    "    response = getDocumentTextDetection(job_id)\n",
    "    \n",
    "    # renaming .pdf to .text\n",
    "    text_output_name = file_obj.replace('.pdf', '.txt')\n",
    "    text_output_name = text_output_name[(text_output_name.rfind('/')+1):]\n",
    "    print('Output Name:\\t', text_output_name)\n",
    "    \n",
    "    \n",
    "    # Writing Textract Output to Text Files:\n",
    "    with codecs.open(output_dir + text_output_name, \"w\", \"utf-8\") as output_file:\n",
    "        for item in response[\"Blocks\"]:\n",
    "            if item[\"BlockType\"] == \"LINE\":\n",
    "                print('\\033[94m' + item[\"Text\"] + '\\033[0m')\n",
    "                output_file.write(item[\"Text\"]+'\\n')\n",
    "    output_file.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Uploading the Textract Result for Inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uploading test document onto S3:\n",
    "test_document = './test_document/test_document.txt'\n",
    "boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix+'/test_document/', 'test_document.txt')).upload_file(test_document)\n",
    "\n",
    "s3_test_document = 's3://' + bucket+'/'+prefix+'/test_document/'+'test_document.txt'\n",
    "s3_test_document_output = 's3://' + bucket+'/'+prefix+'/test_document/'\n",
    "print('Test Document Input: ', s3_test_document)\n",
    "print('Test Document Output: ', s3_test_document_output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls s3://sagemaker-us-east-1-835319576252/textract_comprehend_NER/entity_list/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Start a recognizer Job:\n",
    "custom_recognizer_job_name = 'recognizer-job-'+ str(int(time.time()))\n",
    "\n",
    "recognizer_response = comprehend_client.start_entities_detection_job(\n",
    "    InputDataConfig={\n",
    "        'S3Uri': s3_test_document,\n",
    "        'InputFormat': 'ONE_DOC_PER_LINE'\n",
    "    },\n",
    "    OutputDataConfig={\n",
    "        'S3Uri': s3_test_document_output\n",
    "    },\n",
    "    DataAccessRoleArn=iam_role_textract_comprehend_arn,\n",
    "    JobName=custom_recognizer_job_name,\n",
    "    EntityRecognizerArn=comprehend_model_response['EntityRecognizerProperties']['EntityRecognizerArn'],\n",
    "    LanguageCode='en'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use follow code to check if the Detection Job for completion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "job_response = comprehend_client.describe_entities_detection_job(\n",
    "    JobId=recognizer_response['JobId']\n",
    ")\n",
    "\n",
    "status = job_response['EntitiesDetectionJobProperties']['JobStatus']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "while status != 'COMPLETED':\n",
    "#    if status == 'FAILED':\n",
    "#        exit\n",
    "    job_response = comprehend_client.describe_entities_detection_job(\n",
    "        JobId=recognizer_response['JobId']\n",
    "    )\n",
    "    status = job_response['EntitiesDetectionJobProperties']['JobStatus']\n",
    "    print(status)\n",
    "    time.sleep(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Detection Job Name:\\t', job_response['EntitiesDetectionJobProperties']['JobName'])\n",
    "print('Detection Job ID:\\t', job_response['EntitiesDetectionJobProperties']['JobId'])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Results\n",
    "\n",
    "Once the Analysis job is done, you can download the output and see the results. Here we converted the json result into table format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output_url = job_response['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']\n",
    "\n",
    "print('S3 Output URL:\\t', output_url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from urllib.parse import urlparse\n",
    "\n",
    "#create dir for output file:\n",
    "!mkdir test_document_output\n",
    "\n",
    "# Downloading Output File\n",
    "if job_response['EntitiesDetectionJobProperties']['JobStatus'] == 'COMPLETED':\n",
    "    filename = './test_document_output/output.tar.gz'\n",
    "    output_url_o = urlparse(output_url, allow_fragments=False)\n",
    "    s3_client.download_file(output_url_o.netloc, output_url_o.path.lstrip('/'), filename)\n",
    "\n",
    "    !cd test_document_output; tar -xvzf output.tar.gz\n",
    "    \n",
    "    print(\"Output downloaded ... \")\n",
    "else:\n",
    "    print(\"Analysis job did not finish successfully.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import HTML, display\n",
    "\n",
    "output_file_name = './test_document_output/output'\n",
    "data = [['Start Offset', 'End Offset', 'Confidence', 'Text', 'Type']]\n",
    "\n",
    "with open(output_file_name, 'r', encoding='utf-8') as input_file:\n",
    "    for line in input_file.readlines():\n",
    "        json_line = json.loads(line)  # converting line of text into JSON\n",
    "        entities = json_line['Entities']\n",
    "        if(len(entities)>0):\n",
    "            for entry in entities:\n",
    "                entry_data = [entry['BeginOffset'], entry['EndOffset'], entry['Score'], entry['Text'],entry['Type']]\n",
    "                data.append(entry_data)\n",
    "        \n",
    "display(HTML(\n",
    "   '<table><tr>{}</tr></table>'.format(\n",
    "       '</tr><tr>'.join(\n",
    "           '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in data)\n",
    "       )\n",
    "))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
